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Abstract 



Most recent cache designs use direct-mapped caches to provide the fast ac- 
cess time required by modern high speed CPU's. Unfortunately, direct- 
mapped caches have higher miss rates than set-associative caches, largely be- 
cause direct-mapped caches are more sensitive to conflicts between items 
needed frequently in the same phase of program execution. 

This paper presents a new technique for reducing direct-mapped cache 
misses caused by conflicts for a particular cache line. A small finite state 
machine recognizes the common instruction reference patterns where storing 
an instruction in the cache actually harms performance. Such instructions 
are dynamically excluded, that is they are passed directly through the cache 
without being stored. This reduces misses to the instructions that would have 
been replaced. 

The effectiveness of dynamic exclusion is dependent on the severity of 
cache conflicts and thus on the particular program and cache size of interest. 
However, across the SPEC benchmarks, simulation results show an average 
reduction in miss rate of 35% for a 32KB instruction cache. In addition, 
applying dynamic exclusion to one level of a cache hierarchy can improve the 
performance of the next level since instructions do not need to be stored on 
both levels. Finally, dynamic exclusion also improves combined instruction 
and data cache miss rates. 
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1 Introduction 



In recent years, the dramatic rise of CPU speed has increasingly stressed memory system 
performance. Even though DRAM chips are now much denser, their speed has not kept 
pace with CPU cycle times. This trend increases the importance of cache design to provide 
the instruction and data bandwidth required by modern CPU's. Furthermore, short CPU 
cycle times often require both instruction and data caches to be on the same chip as the CPU 
since crossing chip boundaries leads to unacceptable cache access times. On-chip caches 
must necessarily be small. These factors leads to caches with relatively high miss rates and 
large miss penalties. 

For a given cache size, set-associative caches have a significantly lower miss rate than 
direct-mapped caches. In a set-associative cache, every memory item can be stored in one of 
multiple cache lines. Thus, any two items can be simultaneously stored in a set-associative 
cache. In a direct-mapped cache, each item can be stored in only one cache line. Thus, 
a direct-mapped cache can have many more misses if two items are needed repeatedly in 
the same phase of program execution and they both must be stored in the same cache line. 
In spite of this issue, direct-mapped caches often have better overall performance because 
they have lower access times. 

This paper presents a new hardware technique that improves the miss rate of direct- 
mapped caches, particularly instruction caches. When two instructions compete for the 
same cache line, the technique attempts to keep one instruction in the cache and the other 
instruction out of the cache. Thus, if execution alternates between the two instructions, the 
miss rate is halved. We call this selection of instructions to keep out of the cache dynamic 
exclusion. 

The key to dynamic exclusion is the recognition of the common instruction execution 
patterns. These patterns are typically determined by the loops in the program. Section 3 
describes the common loop structures, the resulting execution patterns, and the optimal 
replacement policy for these patterns. Section 4 describes a simple finite state machine that 
can recognize these patterns and guide a direct-mapped cache toward the optimal replace- 
ment policy. Section 5 discusses an alternate finite state machine with better performance 
for small caches where the reference patterns are slightly different. Section 6 describes the 
implications of dynamic exclusion on the next level of the cache hierarchy. In particular, 
it discusses how to handle misses in the next level cache and how dynamic exclusion can 
reduce these misses. Section 7 describes the interaction between dynamic exclusion and 
caches with block sizes larger than one instruction. Section 8 discusses using dynamic 
exclusion on data and combined caches. Finally, Section 9 gives some concluding remarks. 
First however, we begin with a short discussion of related work in the area of cache design. 
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2 Related Work 



The design of caches to improve memory system performance has been studied extensively. 
For an overview, see Smith [Smi82] . One area of strong interest has been cache organization. 
In this paper, we assume caches are direct mapped. Several studies have shown that direct- 
mapped caches have better overall performance than set- associative caches because of their 
lower access times [Prz88, PHH89, PHH88, Hil87]. 

Many cache replacement policies have also been studied. Belady [Bel66] gave an opti- 
mal replacement policy in the context of page replacement for virtual memories. Belady 's 
algorithm uses future information to establish a theoretical upper bound on the performance 
of any other replacement policy. This paper attempts to predict the future and match 
optimal replacement by recognizing the execution patterns caused by loops. Smith and 
Goodman [SG85] also used a loop model to compared the effectiveness of different in- 
struction cache replacement policies. Loops models have also been used to improve cache 
performance with compiler techniques [HC89, PH90, McF89, WL91]. 

Jouppi [Jou90] also studied a hardware method of reducing the sensitivity of direct- 
mapped caches to conflicts. A small second level associative cache, called a victim cache, 
was used to avoid pathological conflicts between two frequently accessed items that use 
the same cache line. Victim caches work well for data references where the number of 
conflicting items may be small. For instruction references, there are usually many more 
conflicting items than a victim cache can hold. This is where dynamic exclusion is most 
effective. 

3 Common Instruction Reference Patterns 

To understand how direct-mapped cache performance can be improved, we need to look at 
the misses that occur for common instruction reference patterns. For now, we assume that 
each cache miss brings one instruction into the cache. Typically, misses are caused by the 
interference between the pairs of instructions known as conflicting instructions that cannot 
both be stored in cache at the the same time. Conflicts between three or more instructions 
also occur, but less frequently. Normally, when execution of two conflicting instructions 
alternate, there are two misses as each instruction must be brought into the cache. Our 
goal is to change the replacement policy so that whenever possible there will only be one 
miss. To explain how this can be done, we will examine the three most common sources of 
instruction conflicts: 

1 . conflict between instructions in two different loops 

2. conflict between an instruction inside a loop with an instruction outside the loop. 

3 . conflict between two instructions within the same loop. 
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To guide our choice of a new replacement policy, we will compare a conventional direct- 
mapped cache with an optimal direct-mapped cache. By optimal direct-mapped cache, we 
mean that the cache stores instructions in the same place that a direct-mapped cache would, 
but the cache has an optimal replacement policy. With this optimal policy, the cache retains 
the instructions that will be used soonest in the future among those instructions that map to 
each location in the cache. Furthermore, we assume an instruction can be passed directly to 
the CPU without ever being stored in the cache. This allows the instruction in the cache to 
be retained if it will be used sooner in the future than the current instruction will be needed 
again. 

3.1 Conflict Between Loops 

An example of conflict between instructions in different loops can be seen in the following 
example, where instructions a and b map to the same cache location. 

fori = 1 to 10 

forj = 1 to 10 

instruction a 
forj = 1 to 10 

instruction b 



Ignoring the instructions associated with the for loops, this example has the execution 
sequence: 



In this paper, exponents give the number of times a subsequence is repeated. The 
subscripts h and m refer to instructions that hit or miss respectively. For the above 
sequence, both conventional and optimal direct-mapped caches have the behavior: 

(a m a 9 h b m b 9 h ) lQ 

with miss rates: 

m DM = rnopTDM = 10% 

Every time an instruction is executed, it is either already in the cache or should be placed 
in the cache because it is also the next instruction to be executed. Thus, a conventional 
direct-mapped cache already has optimal performance. Any new replacement strategy for 
direct-mapped caches should not change this performance. 
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3.2 Conflict Between Loops Levels 

The following example shows a conflict between an instruction inside a loop with another 
instruction outside the loop: 

fori = 1 to 10 

forj = 1 to 10 

instruction a 
instruction b 

Here, the behavior of an conventional direct-mapped cache is: 

m-DM = 18% 
The behavior of an optimal direct-mapped cache is: 

mopTDM = 10% 

In the conventional direct-mapped cache, each access to b causes two misses. Not only 
does instruction b miss, b knocks a out of the cache and causes a to miss the next time 
it is executed. In the optimal cache, instruction a is kept in the cache even when b is 
executed. Instruction a will only miss once. To achieve the optimal cache behavior, a new 
replacement method should recognize instructions that will only be executed once before 
some other instruction is repeated, and not add such instructions to the cache. 

3.3 Conflict within Loops 

The final example below illustrates the behavior when two instructions within a single loop 
compete for space in the cache. 

fori = 1 to 10 
instruction a 
instruction b 

Here, the behavior of a conventional direct-mapped cache is: 

(a b Y° 
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m DM = 100% 



The behavior of an optimal direct-mapped cache is: 

mopTDM = 55% 

In a conventional cache, both instructions continuously knock each other out of the 
cache. Neither hits. Conversely, the optimal cache selects one instruction to keep in the 
cache. This selected instruction will hit on later executions. To improve on the conventional 
direct-mapped cache, the replacement policy should recognize when two instructions are 
alternating and select one to be kept in the cache. 

4 Dynamic Exclusion Replacement Policy 

We now present a new method for improving direct-mapped cache performance. The 
basic idea is to recognize the patterns presented in the previous section and mimic the 
behavior of an optimal direct-mapped cache. The recognition process treats each cache line 
independently. In this section, we assume each cache line can hold a single instruction at a 
time. Section 7 will generalize the result to larger cache line sizes. 

In a conventional direct-mapped cache, the most recent reference is always placed in 
the cache. As we saw in Section 3, optimal direct-mapped caches can get fewer misses by 
excluding some instructions from the cache. We now consider a new cache replacement 
policy that reduces the number of misses by dynamically determining which instructions 
should be excluded from the cache. The determination uses a simple finite-state machine 
(FSM) in conjunction with two new state bits associated with each cache line. These new 
bits are called sticky and hit-last and are denoted s and h[ ] respectively. Figure 1 shows 
one cache organization that stores these new state bits. The Level 1 (LI) cache contains 
both bits. The Level 2 cache contains only the h[ ] bit. An alternate organization that does 
not require the h[ ] bits to be stored in the L2 cache will be discussed in Section 6. For 
simplicity, Figure 1 and later discussions assume dynamic exclusion is applied to the LI 
cache. Application to other cache levels is also possible. 

To understand the function of the dynamic exclusion state bits, consider again the 
patterns shown in Section 3. Both the second and third patterns require the cache to retain 
an instruction while a conflicting instruction is executed. The sticky bit allows the cache to 
do this without having to exclude instructions forever. Whenever there is a hit, the sticky 
bit for that cache line is set. Normally, an instruction is held in the cache while the sticky 
bit is set. Whenever there is a miss, the sticky bit is reset. Thus, two sequential misses to a 
cache line will always remove it from the cache. 
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Figure 1 : Direct-Mapped Cache with Dynamic Exclusion State Bits 



The hit-last bit tells whether an instruction hit the last time it was in the cache. This 
allows some instructions to be brought immediately into the cache even when the sticky bit 
is set. Without the hit-last bit, the number of misses caused by switches between execution 
phases would double. 

The hit-last bit must be remembered while an instruction is not in the cache. In addition, 
the value associated with an instruction is only needed when that instruction misses. Thus, 
the hit-last bit is logically associated with a lower level in the memory hierarchy. For now, 
we will assume the hit-last bit is stored in the next level cache. Methods for treating misses 
in the next level cache are discussed in Section 6. 

Even though the value is not necessarily used, the hit-last bit could be set on every 
access to the first level cache. Unfortunately, the second level cache is normally too slow 
to be written on all these accesses. The hit-last bit in the first level cache shown in Figure 1 
avoids this problem. Whenever the hit-last bit needs to be set, the LI copy is set. This copy 
is then transferred to the L2 cache when the instruction in the LI cache is replaced. 

Figure 2 gives the state transition diagram for dynamic exclusion FSM. Only the states 
and transitions for two instructions, a and b, are shown. The behavior for other instructions 
is symmetrical. The notation A and B indicate instruction a or Ms in the cache respectively. 
s and \s indicate whether the sticky bit associated with the current cache line is set or reset 
respectively. The notation h[x] refers to the hit-last bit associated with instruction x. As 
discussed above, h[x] refers to the LI bit when the value is changed and the L2 bit when the 
value is used. The semicolons in arc labels separate the input conditions on the left from 
state assignments on the right. As in the language C, assignment is represented by an equal 
sign. 

The state diagram can be understood by examining its behavior for the three common 
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a; h[a]=l b; h[b]=l 




Figure 2: Dynamic Exclusion State Diagram 



instruction reference patterns described in Section 3. Consider first the conflict between 
loops in the pattern (a 10 6 10 ) 10 . The initial states of the sticky bit, the hit-last bits, and 
the current instruction in the cache are unknown. However, for all possible initial states, 
instruction a will be loaded into the cache after at most two misses. During sequential 
executions of a, the cache will remain in State A, s and the hit-last bit h[a] will be set since 
a will hit several times. When b is executed the initial action depends on the initial state of 
h[b]. However, after at most two misses, b will be loaded into the cache and subsequently 
h[b] will be set. 

At this point, both h[a] and h[b] are set, indicating that both a and b are repeated 
whenever they are executed. Subsequently, the cache behavior is the same as an optimal 
direct-mapped cache. Whenever either a or b is executed, the instruction will either already 
be in the cache or be immediately loaded just as an optimal cache would. Thus, for this 
pattern, a direct-mapped cache with dynamic exclusion has at most two more misses than 
an optimal direct-mapped cache depending on the initial state. 

In the conflict between loop levels pattern (a 10 b) 10 , a direct-mapped cache with dynamic 
exclusion again matches the behavior of an optimal direct-mapped cache, perhaps after some 
initial training time. The initial executions of instruction a are identical to the earlier pattern. 
After at most three executions, a will be in the cache and h[a] will be set. Instruction b 
will be loaded only if h[b] is initially set. However, even in this case, h[b] is reset and the 
sticky bit will keep b from ever being loaded again. Again, a direct-mapped cache with 
dynamic exclusion has at most two more misses than an optimal direct-mapped cache. This 
is significantly better than a normal direct-mapped cache where each execution of b causes 
two misses. 
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Figure 3: SPEC Benchmarks Used for Evaluation 

Finally, in the pattern of conflict within a loop (ab) 10 , a direct-mapped cache with 
dynamic exclusion again acts like an optimal direct-mapped cache after some initial activity 
to correctly set the h[ ] and s bits. Depending on the initial conditions, either instruction 
a or b may be kept in the cache. However, eventually one instruction will be kept in. For 
example, if a is initially in the cache, the finite state machine will cycle between states A, s 
and A, Is. Again, after some initial misses the direct-mapped cache with dynamic exclusion 
has only half the misses of a normal direct-mapped cache. 

Before going further, we should note that the state diagram in Figure 2 contains a small 
exception to the definition of h[ ]. The bit does not always mean that the relevant instruction 
hit the last time in the LI cache. For example, in the transition A, Is — >■ B,s, the h[a] 
bit is set even though there is no a hit. This aberration improves the cache performance by 
allowing more random references to get in the cache sooner. This is especially useful for 
data and mixed caches. 

To evaluate the effectiveness of dynamic exclusion, we will use the SPEC benchmarks 
shown in Figure 3. These benchmarks include a mix of symbolic and numeric applications. 
However, to limit cache simulation time, only the first 10 million references from each 
benchmark were used. Results using the full reference streams are similar. 

Figure 4 shows the instruction cache performance for each of the SPEC benchmarks 
for a normal direct-mapped cache, a direct-mapped cache with dynamic exclusion, and 
an optimal direct-mapped cache, all at a cache size of 32KB. The improvement varies 
between benchmarks largely depending on the relative frequency of the patterns discussed 
in Section 3. All the benchmarks with a high instruction cache miss rate show a significant 
improvement. Benchmarks nasa7 and tomcatv show a slight increase in misses with 
dynamic exclusion. This is caused by a small increase in cold-start misses while the 
dynamic exclusion state bits are initialized. For the full instruction reference streams, this 
increase is negligible. 

Figure 5 shows the average instruction cache miss rate across the SPEC benchmarks for 
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Figure 4: Instruction Cache Performance for Various Benchmarks (S=32KB) 



a range of cache sizes with the same three types of caches as in Figure 4. Figure 8 shows 
the percentage reduction from the normal direct-mapped cache miss rate. For very large 
caches, the potential improvement decreases since there are no conflicts when the programs 
fit in the cache. For very small cache sizes, the conflicts are more likely to involve more 
than two instructions and thus not be recognizable by the FSM in Figure 2. The potential 
improvement is also smaller as the degree of conflicts increases. This is reflected by the 
decline in the percentage improvement for an optimal direct-mapped cache for these small 
cache sizes. The average improvement for dynamic exclusion peaks at 35% at a cache size 
of 32KB. 



5 Additional Sticky State 

So far, we have assumed that all conflicts for cache space involve two instructions. Clearly, 
conflicts among three or more instructions are also possible. For example, if the outer 
loop in Section 3.2 has two instructions that map to the same cache location, the execution 
sequence could be (a l0 bc) 10 . Similarly, if the loop in Section 3.3 has four instructions that 
map to one cache location the execution sequence would be (abed) 10 . 

In both these new sequences, an optimal direct-mapped cache locks instruction a in the 
cache and keep the other instructions out. Unfortunately, the finite state machine presented 
in Section 4 cannot recognize these patterns because they require an instruction to be kept 
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Cache Size (KB) 

Figure 5: Instruction Cache Dynamic Exclusion Performance for Various Cache Sizes 

in the cache despite multiple misses. However, we can easily add this ability by adding 
sticky bits. The resulting finite state machine is shown symbolically in Figure 6. Here, s 
is a sticky counter. If the counter has only one bit, then Figure 2 is equivalent to the state 
diagram in Figure 2. However, with a two bit counter the finite state machine in Figure 6 
can recognize the more complex sequences discussed above. 

The results using a two bit counter are shown in Figure 7. The two bit counter has better 
performance for small cache sizes. Here, typical patterns tend to have more instructions 
because a given loop is more likely to reuse the same cache locations. The two bit counter 
has worse performance for larger cache sizes. This is because patterns involving more than 
two instructions are relatively infrequent and a two bit counter takes longer to initialize. 
In addition, the two bit counter can confuse some patterns involving two instructions. For 
example, it might confuse pattern (abed) 10 with (ab 3 ) 10 and keep a in the cache where 
keeping b would have much better performance. 

6 Choices for Lower Level Caches 

In Section 4, we assumed that the second level cache was large. With this assumption, the 
hit-last bits flushed out of the LI cache can normally be found in the L2 cache. In this 
section, we consider smaller L2 caches where there is more interference between hit-last 
bits. In particular, we will consider three ways a cache with dynamic exclusion could 
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state A,s>0{ 

a — >■ state A,s=s max ; h[a] = 1; 
h[b],b — >■ state B,s=s max ; h[b] = 0; 
!h[b],b — >■ state A, s; 

} 

state A,s==0{ 

a — >■ state A,s; h[a] = 1; 
b -> state B,s; h[b] = 1; 

} 



Figure 6: Dynamic Exclusion Finite State Machine 
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Figure 7: Instruction Cache Performance Improvement for Various Cache Sizes 
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respond to an L2 miss: 

1 . use the existing hit-last bit (hashed) 

2. assume the hit-last bit is set (assume-hit) 

3. assume the hit-last bit is not set (assume-miss) 

The first option has a significant structural advantage. The hit-last bits previously kept 
in the L2 cache can be kept in the LI cache since there is no need to insure that the current 
instruction matches the instruction stored in L2. This avoids the need to communicate the 
hit-last information between the caches and even the need for the L2 cache to know that the 
LI cache is using dynamic exclusion. Also, there is no need for the original hit-last bit in 
LI since the bits previously stored in L2 can be accessed directly at LI speeds. 

Figure 8 shows the LI miss rates with dynamic exclusion using each of the three options 
as the L2 cache size is increased. For most L2 cache sizes, the assume-hit option has slightly 
fewer LI misses. Assuming instructions will hit is usually correct. However, if the L2 
cache is the same size as the LI cache, the assume-hit option gives no improvement since 
the cache degenerates to conventional direct-mapped behavior. With all three schemes, 
most of the performance is achieved as long as the L2 cache is at least 4 times as large as 
the LI cache. This is large enough to insure that most LI misses are found in the L2 cache. 
This also implies that the hashing strategy needs only four hit-last bits for each cache line 
to get good performance. 

Figures 9 and 10 show the results of the three options on the L2 miss rates. With the 
hashed and assume-miss strategies, all instructions stored in the LI cache do not need to 
be stored in the L2 cache. This allows the L2 cache to store other instructions and get a 
lower miss rate. This could be particularly useful if the L2 cache is on the same chip as the 
CPU and needs to to be kept small. As the figures show, the assume-miss strategy is best 
at improving the L2 miss rate. This is because it maximizes the difference between the two 
caches. The hashed strategy also has a significant improvement. The assume-hit strategy 
does not help the L2 cache because everything in the LI cache will also be in the L2 cache. 

7 Caches with Longer Line Sizes 

In the discussion so far, we have assumed each cache line contains one instruction. Larger 
cache lines present two problems. First, the sequence of instructions that use each tag 
is much different. If the dynamic exclusion state bits are updated whenever a given tag 
is accessed, the patterns described in Section 3 will no longer be seen. Using the same 
finite state machine, cache lines would rarely be excluded because there are almost always 
several instructions executed sequentially that use the same tag. The second problem is 
that even if the FSM excluded whole cache lines, this would result in poor performance as 
each sequential instruction generated a new miss. 
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Figure 8: Dynamic Exclusion LI Performance for Various L2 Cache Sizes (L1=32KB) 
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Figure 9: Dynamic Exclusion L2 Performance for Various L2 Cache Sizes (L1=32KB) 
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Figure 10: Dynamic Exclusion LI Performance Improvement for Various L2 Cache Sizes 
(L1=32KB) 

We can solve both these problems once we recognize that the pattern of references to 
each position within a cache line is essentially the same. Likewise, if we treat the sequential 
references to each cache line as one reference, the sequence of these references is essentially 
the same as the sequence of references to each instruction within the line. Moreover, the 
pattern of these line references is the same as the patterns discussed in Section 3. If each 
cache line holds one instruction, we were able to reduce the number of misses by excluding 
instructions that are only used once before a competing instruction is executed. A similar 
improvement is possible with larger line sizes if we recognize lines that will only be used 
for sequential instruction executions. These lines should be excluded from the cache, but 
they do need to be held somewhere so that only one miss is required for the sequential 
references in the line. This allows dynamic exclusion to be used without losing the benefits 
of spatial locality. 

There are two particularly simple methods of implementing exclusion of longer cache 
lines: 

1 . use an instruction register the same size as the LI cache line. 

2. logically add a special line to the cache with its own tag to hold the most recently 
referenced line. 

In the first alternative, missing lines are always stored in the instruction register. How- 
ever, the line is only stored in the LI cache if the dynamic exclusion FSM suggests it should 
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Figure 11: Dynamic Exclusion Structure with Longer Cache Lines 



be. Sequential requests for the next instructions are taken from the instruction register 
without changing the dynamic exclusion state. It is only necessary to access the cache 
again when there is a taken branch or the program counter rolls over to the next line. 

The structure for the second alternative is shown in Figure 11. All missing lines are 
stored in the special last-line area. Subsequent sequential references are taken from the 
last-line when there is a match to the last-tag field. Lines are only stored to the LI cache 
as directed by the dynamic exclusion FSM. Finally, the dynamic exclusion state is only 
changed when the current instruction address does not match last-tag. 

Figure 12 shows the performance of the second scheme as the cache line size increases 
with a 32KB instruction cache. The percentage improvement in the miss rate declines 
progressively from 35% with four byte lines to 23% at 64 byte lines. This loss in efficiency 
tracks the internal fragmentation problem of long cache lines. In particular, two instructions 
that do not conflict with a small line size may conflict with a longer line size. These 
added conflicts can prevent the FSM from finding a line that can be excluded to improve 
performance. 



8 Data and Mixed Caches 

The previous sections have only discussed instruction caches. A natural question is whether 
the same techniques can be applied to data caches as well. Figure 1 3 shows the result of using 
the single bit dynamic exclusion FSM to the data references from the SPEC benchmarks. 
Again, only the first 10 million references were used to keep simulation time reasonable. 
For small cache sizes there is a small improvement. However, for larger cache sizes, the 
dynamic exclusion FSM has slightly worse performance than a direct-mapped cache. The 
common data reference patterns are different than those for instructions. In addition, a 
normal direct-mapped cache is closer to optimal for data references than for instruction 
references. Thus, there is less potential for dynamic exclusion to help. 

Figure 14 shows the performance of dynamic exclusion for various sizes of combined 
data and instruction caches with a line size of 4 bytes. For smaller cache sizes, the 
improvement is nearly as large as the improvement for instruction caches. For large caches, 
the improvement is smaller. With these benchmarks, instruction references dominate the 
miss rate for small caches and data references dominate for large caches. 
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Figure 12: Instruction Cache Dynamic Exclusion Performance for Various Line Sizes 
(S=32KB) 
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Figure 13: Data Cache Dynamic Exclusion Performance for Various Cache Sizes 
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Figure 14: Dynamic Exclusion Performance for Various Combined I and D Cache Sizes 

9 Conclusions 

This paper has presented a new technique named dynamic exclusion that reduces the miss 
rate of direct-mapped caches. The technique uses a small finite state machine to recognize 
the common instruction reference patterns. By keeping instructions that will not hit anyway 
out of the cache, the remaining instructions have fewer misses. The reduction in miss rate 
depends on the benchmark and the cache size. However, for a 32KB instruction cache, 
the average miss rate for the SPEC benchmarks is reduced by 35%. In addition, dynamic 
exclusion can improve the performance of the second level cache since some instructions 
only need to be stored in the first level cache. Finally, by reducing the number of instruction 
misses, dynamic exclusion is also useful for combined instruction and data caches. 
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